Working with spatial data in R

Published

January 26, 2025

Modified

January 26, 2025

In R, we usually represent spatial information in separate columns for latitude and longitude in a data frame. As said in Spatial Data Science (Pebesma and Bivand 2023),

It is often thought that spatial data boils down to having observations’ longitude and latitude in a dataset, and treating these just like any other variable. This carries the risk of missed opportunities and meaningless analyses.

While this approach is intuitive, structuring spatial data as specialized objects could be more interesting. For instance, summarizing data spatially — like aggregating observations within administrative boundaries — becomes way easier as we can perform spatial joins with different relations, such as intersections or containment, to combine data in a more straightforward way.

Let’s look at some of these features.

The data

The Building Complaints Dataset contains records of complaints received by the Department of Buildings (DOB) in New York City, related to housing and development. Each observation includes various attributes that provide context for the complaints. Here is a glimpse of the first rows:

 

The first thing we could do is to plot the distribution of the occurrences on a map:


You might wonder whether it’s really important to plot the exact location of each occurrence.

Perhaps a broader visualization would be more effective, such as aggregating the data by the city districts. Or even better, a neighborhood-level visualization would be even more informative!

Occurrences per borough

Occurrences per neighborhood


However, to achieve this, we need to address some challenges. While our dataset includes New York City districts in the borough column, their boundaries remain unknown. Additionally, we have no information on neighborhoods. Even if we did, we would still need a way to determine, based on the lat long, which points fall within each neighborhood.

At its core, we face two challenges:

  1. How to introduce boundaries in our data: what is the appropriate object to work with?
  2. How to perform spatial operations: how can we know a certain point on the map is on a determined neighborhood?

The sf package and the GeoJSON format

Standing for Simple Features, the sf package standardizes a way to encode spatial vector data, extending the concept of data frames by integrating geometry as a special column, preserving spatial relationships natively, such as points, lines, or polygons. In simple terms, a sf object is a data frame with an additional column that stores geometric data.

GeoJSON is a format for encoding a variety of geographic data structures, and supports the following geometry types:


You can load .geojson files in R using the function sf::read_sf. In our case, we downloaded a .geojson file containing the NYC neighborhoods tabulation areas directly from the NYC Open Data website. Here is how it looks:

neighborhoods <- sf::read_sf("files/nyc-neighborhoods.geojson")

neighborhoods |> select(neighborhood = ntaname, geometry)
Simple feature collection with 262 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS 84
# A tibble: 262 × 2
   neighborhood                                                         geometry
   <chr>                                                      <MULTIPOLYGON [°]>
 1 Greenpoint                          (((-73.93213 40.72816, -73.93238 40.7279…
 2 Williamsburg                        (((-73.95814 40.7244, -73.95772 40.72428…
 3 South Williamsburg                  (((-73.95024 40.70547, -73.94984 40.7052…
 4 East Williamsburg                   (((-73.92406 40.71411, -73.92404 40.7140…
 5 Brooklyn Heights                    (((-73.99103 40.69985, -73.99124 40.6993…
 6 Downtown Brooklyn-DUMBO-Boerum Hill (((-73.97906 40.70595, -73.97924 40.7055…
 7 Fort Greene                         (((-73.96284 40.69804, -73.96288 40.6978…
 8 Clinton Hill                        (((-73.96135 40.68046, -73.96329 40.6808…
 9 Brooklyn Navy Yard                  (((-73.97893 40.70593, -73.97893 40.7059…
10 Bedford-Stuyvesant (West)           (((-73.94414 40.67824, -73.94691 40.6783…
# ℹ 252 more rows

References

Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. Boca Raton: Chapman; Hall/CRC. https://doi.org/10.1201/9780429459016.